13 research outputs found

    Assessing Data Usefulness for Failure Analysis in Anonymized System Logs

    Full text link
    System logs are a valuable source of information for the analysis and understanding of systems behavior for the purpose of improving their performance. Such logs contain various types of information, including sensitive information. Information deemed sensitive can either directly be extracted from system log entries by correlation of several log entries, or can be inferred from the combination of the (non-sensitive) information contained within system logs with other logs and/or additional datasets. The analysis of system logs containing sensitive information compromises data privacy. Therefore, various anonymization techniques, such as generalization and suppression have been employed, over the years, by data and computing centers to protect the privacy of their users, their data, and the system as a whole. Privacy-preserving data resulting from anonymization via generalization and suppression may lead to significantly decreased data usefulness, thus, hindering the intended analysis for understanding the system behavior. Maintaining a balance between data usefulness and privacy preservation, therefore, remains an open and important challenge. Irreversible encoding of system logs using collision-resistant hashing algorithms, such as SHAKE-128, is a novel approach previously introduced by the authors to mitigate data privacy concerns. The present work describes a study of the applicability of the encoding approach from earlier work on the system logs of a production high performance computing system. Moreover, a metric is introduced to assess the data usefulness of the anonymized system logs to detect and identify the failures encountered in the system.Comment: 11 pages, 3 figures, submitted to 17th IEEE International Symposium on Parallel and Distributed Computin

    Event Pattern Identification in Anonymized System Logs

    Get PDF
    The size of computing systems and the number of their components steadily increase. The volume of generated system logs is in proportion to this increase. Storing system logs for analyzing and diagnosing systems behavior in large computing systems, requires a high amount of storage capacity. Sensitive data in system logs raise significant concerns about their sharing and publishing. The use of anonymization methods to cleanse sensitive data in system logs before publication reduces the usability of anonymized system logs for further analysis. After a certain level of anonymization, the cleansed system logs lose their semantic and only remain useful for certain statistical analyses. In this work, we address this tradeoff between anonymization and the usefulness of anonymized system logs. This way, full system logs anonymization is guaranteed, minimum storage space is required, and the cleansed system logs remain usable for general statistical analyses. To address the above tradeoff: (1) All variables -of every log entry- need to be replaced with defined constant values. (2) Each log entry maps to a hash-key via a hash function that is resistant to hash-key collisions. (3) The frequency of each hash-key is calculated. (4) The hash-keys are optimized based on their frequency of appearance. Additionally, based on the hash-keys frequency, the non-informative hash-keys will be eliminated. Preliminary results of analyzing system logs from a production system via the proposed method, show up to 95% reduction in required storage capacity, while the precision of the statistical analysis remains unchanged and full anonymity is guaranteed

    Anomaly Detection in High Performance Computers: A Vicinity Perspective

    Full text link
    In response to the demand for higher computational power, the number of computing nodes in high performance computers (HPC) increases rapidly. Exascale HPC systems are expected to arrive by 2020. With drastic increase in the number of HPC system components, it is expected to observe a sudden increase in the number of failures which, consequently, poses a threat to the continuous operation of the HPC systems. Detecting failures as early as possible and, ideally, predicting them, is a necessary step to avoid interruptions in HPC systems operation. Anomaly detection is a well-known general purpose approach for failure detection, in computing systems. The majority of existing methods are designed for specific architectures, require adjustments on the computing systems hardware and software, need excessive information, or pose a threat to users' and systems' privacy. This work proposes a node failure detection mechanism based on a vicinity-based statistical anomaly detection approach using passively collected and anonymized system log entries. Application of the proposed approach on system logs collected over 8 months indicates an anomaly detection precision between 62% to 81%.Comment: 9 pages, Submitted to the 18th IEEE International Symposium on Parallel and Distributed Computin

    Turning Privacy Constraints into Syslog Analysis Advantage

    Get PDF
    Nowadays, failures in high performance computers (HPC) became the norm rather than the exception [10]. In the near future, the mean time between failures (MTBF) of HPC systems is expected to be too short, and that current failure recovery mechanisms e.g., checkpoint-restart, will no longer be able to recover the systems from failures [1]. Early fail- ure detection is a new class of failure recovery methods that can be beneficial for HPC systems with short MTBF. De- tecting failures in their early stage can reduce their negative effects by preventing their propagation to other parts of the system [3]. The goal of the current work is to contribute to the foun- dation of failure detection techniques via sharing an ongo- ing research with the community. Herein we consider user privacy as the main priority, and then turning the applied constraint for protecting users privacy into an advantage for analyzing the system behavior. We use De-identification, constantification, and hashing to reach this goal. Our ap- proach also contributes to the reproducibility and openness of future research in the field. Via this approach, system ad- ministrators can unwarily share their syslogs with the public domain

    Analysis of Node Failures in High Performance Computers Based on System Logs

    Get PDF
    The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. In the near future, it is expected that the mean time between failures of HPC systems becomes too short and that current failure recovery mechanisms will no longer be able to recover the systems from failures. Early failure detection is, thus, essential to prevent their destructive effects. Based on measurements of a production system at TU Dresden over an 8-month time period, we study the correlation of node failures in time and space. We infer possible types of correlations and show that in many cases the observed node failures are directly correlated. The significance of such a study is achieving a clearer understanding of correlations between observed node failures and enabling failure detection as early as possible. The results aimed to help system administrators minimize (or prevent) the destructive effects of failures

    Lessons learned from spatial and temporal correlation of node failures in high performance computers

    Get PDF
    In this paper we study the correlation of node failures in time and space. Our study is based on measurements of a production high performance computer over an 8-month time period. We draw possible types of correlations between node failures and show that, in many cases, there are direct correlations between observed node failures. The significance of such a study is twofold: achieving a clearer understanding of correlations between node failures and enabling failure detection as early as possible. The results of this study are aimed at helping the system administrators minimize (or even prevent) the destructive effects of correlated node failures

    Toward Resilience in High Performance Computing:: A Prototype to Analyze and Predict System Behavior

    No full text
    Following the growth of high performance computing systems (HPC) in size and complexity, and the advent of faster and more complex Exascale systems, failures became the norm rather than the exception. Hence, the protection mechanisms need to be improved. The most de facto mechanisms such as checkpoint/restart or redundancy may also fail to support the continuous operation of future HPC systems in the presence of failures. Failure prediction is a new protection approach that is beneficial for HPC systems with a short mean time between failure. The failure prediction mechanism extends the existing protection mechanisms via the dynamic adjustment of the protection level. This work provides a prototype to analyze and predict system behavior using statistical analysis to pave the path toward resilience in HPC systems. The proposed anomaly detection method is noise-tolerant by design and produces accurate results with as little as 30 minutes of historical data. Machine learning models complement the main approach and further improve the accuracy of failure predictions up to 85%. The fully automatic unsupervised behavior analysis approach, proposed in this work, is a novel solution to protect future extreme-scale systems against failures.:1 Introduction 1.1 Background and Statement of the Problem 1.2 Purpose and Significance of the Study 1.3 Jam–e Jam: A System Behavior Analyzer 2 Review of the Literature 2.1 Syslog Analysis 2.2 Users and Systems Privacy 2.3 Failure Detection and Prediction 2.3.1 Failure Correlation 2.3.2 Anomaly Detection 2.3.3 Prediction Methods 2.3.4 Prediction Accuracy and Lead Time 3 Data Collection and Preparation 3.1 Taurus HPC Cluster 3.2 Monitoring Data 3.2.1 Data Collection 3.2.2 Taurus System Log Dataset 3.3 Data Preparation 3.3.1 Users and Systems Privacy 3.3.2 Storage and Size Reduction 3.3.3 Automation and Improvements 3.3.4 Data Discretization and Noise Mitigation 3.3.5 Cleansed Taurus System Log Dataset 3.4 Marking Potential Failures 4 Failure Prediction 4.1 Null Hypothesis 4.2 Failure Correlation 4.2.1 Node Vicinities 4.2.2 Impact of Vicinities 4.3 Anomaly Detection 4.3.1 Statistical Analysis (frequency) 4.3.2 Pattern Detection (order) 4.3.3 Machine Learning 4.4 Adaptive resilience 5 Results 5.1 Taurus System Logs 5.2 System-wide Failure Patterns 5.3 Failure Correlations 5.4 Taurus Failures Statistics 5.5 Jam-e Jam Prototype 5.6 Summary and Discussion 6 Conclusion and Future Works Bibliography List of Figures List of Tables Appendix A Neural Network Models Appendix B External Tools Appendix C Structure of Failure Metadata Databse Appendix D Reproducibility Appendix E Publicly Available HPC Monitoring Datasets Appendix F Glossary Appendix G Acronym

    Toward Resilience in High Performance Computing:: A Prototype to Analyze and Predict System Behavior

    Get PDF
    Following the growth of high performance computing systems (HPC) in size and complexity, and the advent of faster and more complex Exascale systems, failures became the norm rather than the exception. Hence, the protection mechanisms need to be improved. The most de facto mechanisms such as checkpoint/restart or redundancy may also fail to support the continuous operation of future HPC systems in the presence of failures. Failure prediction is a new protection approach that is beneficial for HPC systems with a short mean time between failure. The failure prediction mechanism extends the existing protection mechanisms via the dynamic adjustment of the protection level. This work provides a prototype to analyze and predict system behavior using statistical analysis to pave the path toward resilience in HPC systems. The proposed anomaly detection method is noise-tolerant by design and produces accurate results with as little as 30 minutes of historical data. Machine learning models complement the main approach and further improve the accuracy of failure predictions up to 85%. The fully automatic unsupervised behavior analysis approach, proposed in this work, is a novel solution to protect future extreme-scale systems against failures.:1 Introduction 1.1 Background and Statement of the Problem 1.2 Purpose and Significance of the Study 1.3 Jam–e Jam: A System Behavior Analyzer 2 Review of the Literature 2.1 Syslog Analysis 2.2 Users and Systems Privacy 2.3 Failure Detection and Prediction 2.3.1 Failure Correlation 2.3.2 Anomaly Detection 2.3.3 Prediction Methods 2.3.4 Prediction Accuracy and Lead Time 3 Data Collection and Preparation 3.1 Taurus HPC Cluster 3.2 Monitoring Data 3.2.1 Data Collection 3.2.2 Taurus System Log Dataset 3.3 Data Preparation 3.3.1 Users and Systems Privacy 3.3.2 Storage and Size Reduction 3.3.3 Automation and Improvements 3.3.4 Data Discretization and Noise Mitigation 3.3.5 Cleansed Taurus System Log Dataset 3.4 Marking Potential Failures 4 Failure Prediction 4.1 Null Hypothesis 4.2 Failure Correlation 4.2.1 Node Vicinities 4.2.2 Impact of Vicinities 4.3 Anomaly Detection 4.3.1 Statistical Analysis (frequency) 4.3.2 Pattern Detection (order) 4.3.3 Machine Learning 4.4 Adaptive resilience 5 Results 5.1 Taurus System Logs 5.2 System-wide Failure Patterns 5.3 Failure Correlations 5.4 Taurus Failures Statistics 5.5 Jam-e Jam Prototype 5.6 Summary and Discussion 6 Conclusion and Future Works Bibliography List of Figures List of Tables Appendix A Neural Network Models Appendix B External Tools Appendix C Structure of Failure Metadata Databse Appendix D Reproducibility Appendix E Publicly Available HPC Monitoring Datasets Appendix F Glossary Appendix G Acronym
    corecore